Show the code
pacman::p_load(sf, spdep, tmap, tidyverse, knitr, sp, reshape2, stplanr, httr)Widya Tantiya Yutika
December 8, 2023
December 15, 2023
Urban mobility challenges encompass understanding the driving forces behind early morning commutes for city dwellers and evaluating the impact of removing public bus services along specific routes, presenting complex issues for transport operators and urban managers. Traditional commuter surveys, though effective, are costly and time-consuming. With the digitization of urban infrastructures, particularly public transportation, massive geospatial data sets are generated through technologies like GPS and SMART cards. However, the inability to efficiently utilize this data can hinder effective decision-making. This exercise aims to address two key issues: the lack of research on integrating diverse open data sources for policymaking and the insufficient exploration of geospatial data science for decision support.
The objective is to conduct a case study showcasing the potential of geospatial data science in integrating publicly available data to build spatial interaction models, explaining factors influencing urban mobility patterns in public bus.
Open Government Data
Passenger Volume by Origin Destination Bus Stops from LTA DataMall.
School Directory and Information from Data.gov.sg.
Instructor-curated Datasets for Educational Purpose
Open Government Data
Bus Stop Location, Train Station and Train Station Exit Point from LTA DataMall.
Master Plan 2019 Subzone Boundary from Data.gov.sg.
Instructor-curated Datasets for Educational Purpose
The specific tasks of this take-home exercise are as follows:
Derive an analytical hexagon data of 375m (this distance is the perpendicular distance between the centre of the hexagon and its edges) to represent the traffic analysis zone (TAZ).
With reference to the time intervals provided in the table below, construct an O-D matrix of commuter flows for a time interval of your choice by integrating Passenger Volume by Origin Destination Bus Stops and Bus Stop Location from LTA DataMall. The O-D matrix must be aggregated at the analytics hexagon level
| Peak hour period | Bus tap on time |
|---|---|
| Weekday morning peak | 6am to 9am |
| Weekday afternoon peak | 5pm to 8pm |
| Weekend/holiday morning peak | 11am to 2pm |
| Weekend/holiday evening peak | 4pm to 7pm |
Display the O-D flows of the passenger trips by using appropriate geovisualisation methods (not more than 5 maps).
Describe the spatial patterns revealed by the geovisualisation (not more than 100 words per visual).
Assemble at least three propulsive and three attractiveness variables by using aspatial and geospatial from publicly available sources.
Compute a distance matrix by using the analytical hexagon data derived earlier.
Calibrate spatial interactive models to determine factors affecting urban commuting flows at the selected time interval.
Present the modelling results by using appropriate geovisualisation and graphical visualisation methods. (Not more than 5 visuals)
With reference to the Spatial Interaction Model output tables, maps and data visualisation prepared, describe the modelling results. (not more than 100 words per visual).
Before I get started, I need to ensure that sf, spdep, tmap, tidyverse, and knitr packages of R are currently installed in my R.
sf : for importing and handling geospatial data in R,
spdep : for computing spatial weights, global and local spatial autocorrelation statistics, and
tmap : for preparing cartographic quality chropleth map
tidyverse : for wrangling attribute data in R ; tidyverse has already included collection of packages such as readr, ggplot2, dplyr, tiblle, purr, etc.
knitr: for facilitating dynamic report generation in R Markdown documents.
sp:???
reshape2
stplanr
httr
The code chunk below is used to ensure that the necessary R packages have been installed , if it is yet to be installed, it will then be installed and ready to be used in the R environment.
Next, I will import Passenger Volume by Origin Destination Bus Stops data set: origin_destination_bus_202310.csv downloaded from LTA Datamall for October 2023 into R by using st_read() of readr package. The output is R dataframe class.
The code chunk below uses glimpse() of dplyr package to display the odbus tibble data tables.
Rows: 5,694,297
Columns: 7
$ YEAR_MONTH <chr> "2023-10", "2023-10", "2023-10", "2023-10", "2023-…
$ DAY_TYPE <chr> "WEEKENDS/HOLIDAY", "WEEKDAY", "WEEKENDS/HOLIDAY",…
$ TIME_PER_HOUR <dbl> 16, 16, 14, 14, 17, 17, 17, 7, 14, 14, 10, 20, 20,…
$ PT_TYPE <chr> "BUS", "BUS", "BUS", "BUS", "BUS", "BUS", "BUS", "…
$ ORIGIN_PT_CODE <chr> "04168", "04168", "80119", "80119", "44069", "2028…
$ DESTINATION_PT_CODE <chr> "10051", "10051", "90079", "90079", "17229", "2014…
$ TOTAL_TRIPS <dbl> 3, 5, 3, 5, 4, 1, 24, 2, 1, 7, 3, 2, 5, 1, 1, 1, 1…
From the glimpse() check above, it is shown that the ORIGIN_PT_CODE and DESTINATION_PT_CODE are in character type. Both of them need to be converted to factor type to work with categorical variables so that I can use them to georeference with bus stop location data.
Next, I will confirm the data type changes using glimpse().
Rows: 5,694,297
Columns: 7
$ YEAR_MONTH <chr> "2023-10", "2023-10", "2023-10", "2023-10", "2023-…
$ DAY_TYPE <chr> "WEEKENDS/HOLIDAY", "WEEKDAY", "WEEKENDS/HOLIDAY",…
$ TIME_PER_HOUR <dbl> 16, 16, 14, 14, 17, 17, 17, 7, 14, 14, 10, 20, 20,…
$ PT_TYPE <chr> "BUS", "BUS", "BUS", "BUS", "BUS", "BUS", "BUS", "…
$ ORIGIN_PT_CODE <fct> 04168, 04168, 80119, 80119, 44069, 20281, 20281, 1…
$ DESTINATION_PT_CODE <fct> 10051, 10051, 90079, 90079, 17229, 20141, 20141, 1…
$ TOTAL_TRIPS <dbl> 3, 5, 3, 5, 4, 1, 24, 2, 1, 7, 3, 2, 5, 1, 1, 1, 1…
There is no duplicate record found on odbus dataset.
For the purpose of this take-home exercise, I will extract commuting flows on Weekend/holiday morning peak which is between 11am to 2pm.
The code chunk below uses st_read() of sf package to import BusStop shapefile into R. The imported shapefile will be simple features Object of sf. Then, I use st_transform of sp package to convert coordinates to EPSG code of 3414 for SVY21 (projected coordinate system for Singapore).
Reading layer `BusStop' from data source
`W:\widyayutika\ISSS624\Take-home_Exercise\Take-home_Ex2\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 5161 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 3970.122 ymin: 26482.1 xmax: 48284.56 ymax: 52983.82
Projected CRS: SVY21
There are 16 duplicated records with same BUS_STOP_N but different geometry points. However, upon investigation, the coordinates are quite near to each other, this lead to the possibility of temporary bus stop. In view of this, I will remove the duplicated records and only keep the first occurrence using the code chunk below
The code chunk below uses st_read() of sf package to import MPSZ-19 shapefile into R. The imported shapefile will be simple features Object of sf. Then, I use st_transform of sp package to convert coordinates to EPSG code of 3414 for SVY21 (projected coordinate system for Singapore).
Reading layer `MPSZ-2019' from data source
`W:\widyayutika\ISSS624\Take-home_Exercise\Take-home_Ex2\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 332 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 103.6057 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
Geodetic CRS: WGS 84
Honeycomb layer are preferred to replace coarse and irregular Master Plan 2019 Sub-zone GIS data set of URA because hexagon reduce sampling bias due to its grid shape of low perimeter to are ratio and its ability to form evenly spaced grid. Honeycomb grids are well-suited for approximating circular areas, making them suitable for mapping Singapore edges with irregular shape.
The code chunk below uses st_make_grid of sf package to create a hexagonal or honeycomb grid with a 375m (perpendicular distance between the center of hexagon and its edges) to represent the traffic analysis zone (TAZ). According the the R documentation, the cellsize is the distance between opposite edges, which is 2 times the perpendicular distance between the center of hexagon and its edges. Thus, for the purpose of this exercise, I will use cellsize of 750m and indicate the square=FALSE for hexagonal grid. After doing do, I will create a grid_id for each hexagonal grid (2,541 hexagonal grid).
Next, I will only retain the hexagons with at least 1 bus stop in it.
There are 834 hexagonal units with at least 1 busstop.
Some hexagonal grids are highly packed with a maximum of 19 bus stops shown in yellow color above.
In this section, I uses st_intersection() from sp package to overlap the bus stop points and hexagonal grids.
In this section, I will construct a matrix for Origin and Destination with the Trips counts on each combinations.
First, I will append “ORIGIN_GRID_ID” and “ORIGIN_LOC_DESC” from busstop_honeycomb data frame onto odbus11_14 data frame using the code chunk below.
Next, I will check for duplicated records on od_data using the chunk code below.
The above code chunk shows that there are no duplicate record found.
Next, I will update od_data data frame by performing another left join with busstop_honeycomb to get the “DESTIN_GRID_ID” and “DESTIN_LOC_DESC”.
missing_values <- colSums(is.na(od_data))
# Print the columns with missing values
print(missing_values) ORIGIN_BS DESTIN_BS TRIPS ORIGIN_LOC_DESC ORIGIN_GRID_ID
0 0 0 2720 2199
DESTIN_LOC_DESC DESTIN_GRID_ID
2849 2257
The missing grid_id for origin and destination bus stop may be due to the outdated Bus Stop Location Data which is uploaded on July 2023 as compared to the Passenger Volume by Origin Destination Bus Stops data collected in October 2023.
So, I will remove the rows with NA and calculate the number of trips on weekends/holiday morning peak hour by using the group_by “ORIGIN_GRID_ID” and “DESTIN_GRID_ID” and sum all the trips between each combination of hexagonal grid ids using the code chunk below.
I will check the distribution of the Morning Peak trips using summary as shown below.
I will save the output into an rds file format.
First as.Spatial() from sp package will be used to convert honeycomb_with_busstop from sf tibble data frame to SpatialPolygonsDataFrame of sp object as shown in the code chunk below. This method is used because it is faster to compute distance matrix using sp than sf package.
class : SpatialPolygonsDataFrame
features : 834
extent : 3595.122, 48595.12, 26049.09, 53545.39 (xmin, xmax, ymin, ymax)
crs : +proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
variables : 2
names : grid_id, busstop_count
min values : 23, 1
max values : 2505, 19
Next, spDists() of sp package will be used to compute the Euclidean distance between the centroids of the hexagons. When longlat is set to FALSE, spDists will return a full matrix of distances in the metric of points. While longlat is set to TRUE, it will return in kilometers. In my case, since there are 834 hexagon, amtrix of 834 x 834 will be created and I will print out the first 8 rows using head().
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 0.000 750.000 3269.174 1500.000 2704.163 3968.627 1299.038 2250.000
[2,] 750.000 0.000 2598.076 750.000 1984.313 3269.174 750.000 1500.000
[3,] 3269.174 2598.076 0.000 1984.313 750.000 750.000 2704.163 1500.000
[4,] 1500.000 750.000 1984.313 0.000 1299.038 2598.076 750.000 750.000
[5,] 2704.163 1984.313 750.000 1299.038 0.000 1299.038 1984.313 750.000
[6,] 3968.627 3269.174 750.000 2598.076 1299.038 0.000 3269.174 1984.313
[7,] 1299.038 750.000 2704.163 750.000 1984.313 3269.174 0.000 1299.038
[8,] 2250.000 1500.000 1500.000 750.000 750.000 1984.313 1299.038 0.000
From the matrix above, it is noticed that both the column and row headers are not labelled by grid_id. So, I label the headers by first creating a list sorted according to the distance matrix by grid_id and followed by attaching the grid_id to row and column for distance matrix matching.
grid_id_desc <- honeycomb_with_busstop$grid_id
colnames(dist) <- paste0(grid_id_desc)
rownames(dist) <- paste0(grid_id_desc)
head(dist, n=c(8,8)) 23 44 46 66 67 68 86 87
23 0.000 750.000 3269.174 1500.000 2704.163 3968.627 1299.038 2250.000
44 750.000 0.000 2598.076 750.000 1984.313 3269.174 750.000 1500.000
46 3269.174 2598.076 0.000 1984.313 750.000 750.000 2704.163 1500.000
66 1500.000 750.000 1984.313 0.000 1299.038 2598.076 750.000 750.000
67 2704.163 1984.313 750.000 1299.038 0.000 1299.038 1984.313 750.000
68 3968.627 3269.174 750.000 2598.076 1299.038 0.000 3269.174 1984.313
86 1299.038 750.000 2704.163 750.000 1984.313 3269.174 0.000 1299.038
87 2250.000 1500.000 1500.000 750.000 750.000 1984.313 1299.038 0.000
Next, I will pivot the distance matrix into a long table by using the row and column grid_id using melt() of reshape2 package to convert wide-format data to long-format data.
For reference : https://www.statology.org/long-vs-wide-data/

Var1 Var2 dist
1 23 23 0.000
2 44 23 750.000
3 46 23 3269.174
4 66 23 1500.000
5 67 23 2704.163
6 68 23 3968.627
7 86 23 1299.038
8 87 23 2250.000
9 88 23 3436.932
10 89 23 4683.748
There are 695556 rows of distPair as the number of possible pair of hexagon permutations.
From the summary of “dist” above, it is noticed that there are 0 distance, this refer to intra-zonal distance meaning that the trips are taken from same grid id. Since my minimum inter-zonal distance is 750m, it will set my intra-zonal distance to value below 750/2 = 375m. So, I choose a value of 350m for intra-zonal distance.
Var1 Var2 dist
Min. : 23 Min. : 23 Min. : 750
1st Qu.: 871 1st Qu.: 871 1st Qu.: 8352
Median :1324 Median :1324 Median :13332
Mean :1269 Mean :1269 Mean :14162
3rd Qu.:1688 3rd Qu.:1688 3rd Qu.:18929
Max. :2505 Max. :2505 Max. :44680
Next, I will rename var1 and var 2 into origin and destination grid_id.
In this section, I will rename my od_data1 to flow_data.
# A tibble: 6 × 5
# Groups: ORIGIN_GRID_ID [2]
ORIGIN_GRID_ID DESTIN_GRID_ID MORNING_PEAK ORIGIN_DESC DESTIN_DESC
<int> <int> <dbl> <chr> <chr>
1 44 128 4 BEF TUAS STH AVE 14 AFT TUAS ST…
2 44 129 2 BEF TUAS STH AVE 14 BEF TUAS ST…
3 44 175 20 BEF TUAS STH AVE 14 TUAS WEST R…
4 46 67 2 SEE HUP SENG BEF TUAS ST…
5 46 111 6 YONG NAM AFT TUAS ST…
6 46 134 73 SEE HUP SENG, YONG NAM TUAS LK STN
The code chunk below is used to add two new fields in flow_data dataframe.
‘FlowNoIntra’ : 0 for intra-zonal flow and MORNING_PEAK for inter-zonal flow
‘offset’ : a very small number (i.e. 0.000001) for intra-zonal flow and 1 for inter-zonal flow
I will create 2 separate data frames for intra and inter zonal flow.
From a total of 62176 flow data, there are 568 intra-zonal flow and 61,608 inter-zonal flow.
# A tibble: 568 × 7
# Groups: ORIGIN_GRID_ID [568]
ORIGIN_GRID_ID DESTIN_GRID_ID MORNING_PEAK ORIGIN_DESC DESTIN_DESC
<int> <int> <dbl> <chr> <chr>
1 111 111 1 "TUAS TECHNOLOGY PK" AFT TUAS S…
2 133 133 4 "BEF TUAS WEST AVE, A… AFT PIONEE…
3 150 150 1 "OPP SLS" BEF TUAS S…
4 154 154 9 "AFT TUAS WEST RD STN" ROTARY ENG…
5 175 175 8 "OPP NISSIN TPT, OPP … TUAS WEST …
6 176 176 13 "TUAS TER, BEF TUAS W… OPP TUAS L…
7 216 216 19 "TUAS CRES STN EXIT B… TUAS CRES …
8 217 217 13 "AFT TUAS AVE 7, SMC … OPP SMC MF…
9 237 237 1 "BEF TUAS AMENITY CTR" TRU-MARINE
10 278 278 1 "KEPPEL SHIPYARD" DUNDEE MAR…
# ℹ 558 more rows
# ℹ 2 more variables: FlowNoIntra <dbl>, offset <dbl>
Before we can join “inter_zonal_flow” and “distPair”, I will convert data value type of ORIGIN_GRID_ID and DESTIN_GRID_ID fields of both dataframe into factor data type.
Now, left_join() of dplyr will be used to combine “inter_zonal_flow” dataframe and distPair dataframe. The output is called flow_data1.
flow_data1 <- inter_zonal_flow %>%
left_join (distPair,
by = c("ORIGIN_GRID_ID" = "ORIGIN_GRID_ID",
"DESTIN_GRID_ID" = "DESTIN_GRID_ID"))
print(head(flow_data1))# A tibble: 6 × 8
# Groups: ORIGIN_GRID_ID [2]
ORIGIN_GRID_ID DESTIN_GRID_ID MORNING_PEAK ORIGIN_DESC DESTIN_DESC FlowNoIntra
<fct> <fct> <dbl> <chr> <chr> <dbl>
1 44 128 4 BEF TUAS S… AFT TUAS S… 4
2 44 129 2 BEF TUAS S… BEF TUAS S… 2
3 44 175 20 BEF TUAS S… TUAS WEST … 20
4 46 67 2 SEE HUP SE… BEF TUAS S… 2
5 46 111 6 YONG NAM AFT TUAS S… 6
6 46 134 73 SEE HUP SE… TUAS LK STN 73
# ℹ 2 more variables: offset <dbl>, dist <dbl>
In this section, I will create a desire line of inter-zonal flow (flow_data1) by using od2line of stplanr package.
To visualise the resulting desire lines, the code chunk below is used. In the map below, I have filtered out the trips of less than 5000 for ease of analysis. Thicker line width refers to the flow with more trips while the length of the desire lines refers to the distance of each inter-zonal flow.
tmap_options(check.and.fix = TRUE)
tmap_mode("plot")
tm_shape(mpsz)+
tm_polygons(alpha=0.5)+
tm_text(text = "SUBZONE_N", size = 0.2) +
tm_shape(honeycomb_with_busstop) +
tm_polygons(col = "lightblue", alpha=0.5) +
flowLine %>% filter(MORNING_PEAK >= 5000) %>%
tm_shape() +
tm_lines(lwd = "MORNING_PEAK",
style = "quantile",
scale = c(0.5, 2, 4, 6, 8, 10, 12),
n = 6,
alpha = 0.3,
col="red")+
tm_layout(main.title = 'Static Map: OD Flow On Weekends/Holiday Morning Peak Hour' ,
main.title.position = "center",
main.title.size = 1.0,
legend.width=1)
From the map above, I observed a noticeable concentration of public bus flow in the Woodlands Area and Johor Bahru (hexagon outside of Singapore boundary) as indicated by a thick bank extending between the two for weekends/holiday morning peak hour. We will further analyse whether the flow is from Singapore or from Johor Bahru.
Simple feature collection with 2 features and 8 fields
Geometry type: LINESTRING
Dimension: XY
Bounding box: xmin: 20470.12 ymin: 47266.71 xmax: 20845.12 ymax: 49215.27
Projected CRS: SVY21 / Singapore TM
ORIGIN_GRID_ID DESTIN_GRID_ID MORNING_PEAK ORIGIN_DESC
1 962 984 43418 W'LANDS CHECKPT
2 984 983 31087 JOHOR BAHRU CHECKPT
DESTIN_DESC FlowNoIntra offset dist
1 JOHOR BAHRU CHECKPT 43418 1 1984.313
2 W'LANDS CHECKPT 31087 1 1299.038
geometry
1 LINESTRING (20470.12 47266....
2 LINESTRING (20845.12 49215....
The public bus flow suggest that the predominant movement is from Woodlands Checkpoint to Johor Bahru Checkpoint with around 43k trips. The flow from Johor Bahru Checkpoint to Singapore is also comparably high with around 31k trips. This public bus commuter flow could be attributed to individuals engaging in leisure activities, entertainment, tourism, family/friend visit in Johor/Singapore, some even take advantage of cost-effective grocery shopping in Johor.
Min. 1st Qu. Median Mean 3rd Qu. Max.
750 750 750 1056 1299 2704
Furthermore, I observed that during weekend morning, the predominant public bus flow spans a relatively short distance with a maximum of 2.7 kilometers. This phenomenon may be attributed to the fact that weekends are typically considered rest days, leading to shorter commutes or more localized activities within a confined radius. Alternatively, individuals may opt for different modes of transport like MRT longer transfers. In the view of activities in confined redius, I will also analyse intra-zonal flows.
intra_flow_map <- honeycomb_with_busstop %>%
inner_join(intra_zonal_flow, by = c("grid_id" = "ORIGIN_GRID_ID")) %>%
rename("LOC_DESC"="ORIGIN_DESC") %>%
select("grid_id","LOC_DESC","MORNING_PEAK","busstop_count","area_honeycomb_grid")
intra_flow_map <- intra_flow_map %>%
mutate(short_loc_desc = str_extract(LOC_DESC, "^[^,]+"))tmap_options(check.and.fix = TRUE)
tmap_mode("view")
tm_shape(mpsz)+
tm_polygons(alpha=0.5)+
tm_shape(intra_flow_map) +
tm_borders() +
tm_fill("MORNING_PEAK", palette = "Blues",
title = "Morning Peak",
breaks= c(1,1000,5000,8000,11000,14500),
legend.show = TRUE,
popup.vars = c("Location Description" ="short_loc_desc",
"Trips Count"="MORNING_PEAK",
"Bus Stop Count"="busstop_count"
)) +
tm_view(set.zoom.limits =c(11,17))+
tm_layout(main.title = 'Intra Flow On Weekends/Holiday Morning Peak Hour' ,
main.title.position = "center",
main.title.size = 1.0,
main.title.fontface = 'bold',
legend.width=1)From the map above, the intra public bus flow occurs in the vicinity of Admiralty Int, BLK 126 CP (Bukit Merah and Tiong Bahru Area). This suggested that there is a significant volume occuring within a short distance in the specified areas.
The below code chunk is for the interactive maps to get more insights.
tmap_options(check.and.fix = TRUE)
tmap_mode("view")
tm_shape(mpsz)+
tm_polygons(alpha=0.5)+
tm_shape(honeycomb_with_busstop) +
tm_polygons(col = "lightblue", alpha=0.5) +
flowLine %>% filter(MORNING_PEAK >= 5000) %>%
tm_shape() +
tm_lines(lwd = "MORNING_PEAK",
style = "quantile",
scale = c(2, 6, 8, 10, 8, 12, 14),
n = 6,
alpha = 0.3,
col="red",
popup.vars=c("Origin Description"="ORIGIN_DESC",
"Destination Description" ="DESTIN_DESC",
"Trips Count" ="MORNING_PEAK"))+
tm_view(set.zoom.limits =c(11,17))+
tm_layout(main.title = 'Static Map: OD Flow On Weekends/Holiday Morning Peak Hour' ,
main.title.position = "center",
main.title.size = 1.0,
main.title.fontface = 'bold')From the map above, I also noticed that there are some great volume to hexagonal grid with bus interchanges or MRt/LRT stations such as Boonlay, Choa Chu Kang, Bukit Panjang, Yee Tew, Clementi, Bishan, Ang Mo Kio, Serangoon, Toa Payoh, Sengkang, Bedok, Tampines, Pasir Ris, which can be one of the factor to be considered for my Spatial Interaction Model later on.
First, I will import the School General Information in csv format which is downloaded from data.gov.sg for School Directory and Information. This dataset consist of school name and postal code for MOE Kindergartens, Primary Schools, Secondary Schools, Junior College, Mixed Levels and Centralized Institute for pre-university. In total, there are 346 records. Since the data set does not include polytechnics and ITE, I will collate the information from moe.gov.sg. Currently, there are 8 polytechnics/ITE. Next, I will merge the 2 dasasets and so in total there are 354 schools.
Next, I will use OneMap API to geocode the school locations by retriving the longitude and latitude coordinates using the postal_code field.
#|eval: false
url <- "https://www.onemap.gov.sg/api/common/elastic/search"
postcodes <- merged_school$'postal_code'
found <- data.frame()
not_found <- data.frame()
for (postcode in postcodes){
query <- list('searchVal'=postcode,'returnGeom'='Y','getAddrDetails'='Y','pageNum'='1')
res<-GET(url, query=query)
if((content(res)$found)!=0){
found <-rbind(found, data.frame(content(res))[4:13])
} else{
not_found = data.frame(postcode)
}
}
merged = merge(merged_school, found, by.x='postal_code', by.y='results.POSTAL', all=TRUE)It is noticed that the number of rows are increasing, there may be duplicated rows and upon checking, there are 8 duplicated rows so I will remove the duplicated rows.
Then , I will check wehther there is any postal code that cannot be geocode using the code chunk below.
There is 1 postal code that cannot be geocode which is 677741. In this case, I will manually extract the longitude (103.7651) and latitude(1.389279) for this specific postal code and fill the NA values with the code chunk below
I will then write the merged file into a csv file.
Next, I will import the schools final data set and tidy it up (i.e. renaming the column headers) and only extract necessary fields using the code chunk below.
Then, I will convert the school coordinates into Singapore Projected Coordinate System SVY21 after converting to sf object.
After that, I will count the number of schools in each heaxgon using the st_intersects() of sp package.
Next, I will check the school_count distributions using summary().
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.0000 0.0000 0.4185 1.0000 4.0000
tmap_options(check.and.fix = TRUE)
tmap_mode("view")
tm_shape(mpsz)+
tm_polygons(alpha=0.3)+
tm_shape(honeycomb_with_busstop%>% filter(SCHOOL_COUNT>0)) +
tm_borders() +
tm_fill("SCHOOL_COUNT", palette = "Blues",
title = "School Count",
breaks= c(1,2,3,4,4),
legend.show = TRUE,
alpha=0.7) +
tm_view(set.zoom.limits =c(11,14))+
tm_layout(main.title = 'SCHOOL COUNT on Hexagonal Grid' ,
main.title.position = "center",
main.title.size = 1.0,
main.title.fontface = 'bold',
legend.width=1)From the above map, it is noticeable that schools are spread across Singapore and some area like Punggol, Sengkang have more schools as compared to other areas and the CBD and Tuas area has fewer schools.
I will import the hdb in csv format provided and collated by Prof. Kam. This data set is the geocoded version of HDB Property Information data from Data.gov.sg. The data set is prepared using September 2021 data, consisting of 12,442 records with 37 columns.
Next, I will convert the longitude and latitude from WSG84 to Singapore Projected Coordinate System SVY21 using st_transform().
Next, I will check for duplicated hdb.
There are no duplicated hdb, so next, I will count the dwelling units in each hexagonal grid.
Next, I will check the distribution of dwelling units in Singapore.
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0 955 2364 2727 4077 7946 436
Next, I will NA with 0, meaning there is the hexagonal grid is not a residential area.
tmap_options(check.and.fix = TRUE)
tmap_mode("view")
tm_shape(mpsz)+
tm_polygons(alpha=0.3)+
tm_shape(honeycomb_with_busstop%>% filter(HDB_DWELLING_UNIT>0)) +
tm_borders() +
tm_fill("HDB_DWELLING_UNIT", palette = "Blues",
title = "HDB Dwelling Units Count",
breaks= c(1,1000,2000,4000,6000,7946),
legend.show = TRUE,
alpha=0.7) +
tm_view(set.zoom.limits =c(11,14))+
tm_layout(main.title = 'HDB Dwelling Units on Hexagonal Grid' ,
main.title.position = "center",
main.title.size = 1.0,
main.title.fontface = 'bold',
legend.width=1)From the map above, the HDB spread across Singapore with high concentration on Punggol, Sengkang, Tampine, Bedok, Woodlands, Yishun, Choa Chu Kang, Jurong West.
This business data is curated by Prof. Kam, consisting locations of business establishments in Singapore. It consists of 6,550 point coordinates of business location in Singapore. The shapefile is already in SVY21 form so there is no need to do st_transform().
Reading layer `Business' from data source
`W:\widyayutika\ISSS624\Take-home_Exercise\Take-home_Ex2\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 6550 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 3669.148 ymin: 25408.41 xmax: 47034.83 ymax: 50148.54
Projected CRS: SVY21 / Singapore TM
Next, I will check for duplicated business.
Apparently there are 2 duplicate businesses, so I will remove the duplicated records and keep the first occurrence.
Next, I will count the number of businesses in each hexagonal grid.
Next, I will check the distribution of BUSINESS_COUNT.
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.000 1.000 7.272 7.000 97.000
tmap_options(check.and.fix = TRUE)
tmap_mode("view")
tm_shape(mpsz)+
tm_polygons(alpha=0.3)+
tm_shape(honeycomb_with_busstop%>% filter(BUSINESS_COUNT>0)) +
tm_borders() +
tm_fill("BUSINESS_COUNT", palette = "Blues",
title = "Business Count",
breaks= c(1,10,20,30,50,97),
legend.show = TRUE,
alpha=0.7) +
tm_view(set.zoom.limits =c(11,14))+
tm_layout(main.title = 'BUSINESS COUNT on Hexagonal Grid' ,
main.title.position = "center",
main.title.size = 1.0,
main.title.fontface = 'bold',
legend.width=1)From the map above, for business, it is observed that certain regions especially industrial area like Jurong West to Tuas Area, Woodlands Industrial Area, Kaki Bukit Industrial Estate, Loyang Industrial Estate, exhibits a higher concentration of businesses. The CBD/ Downtown area also has high number of businesses as it is the main commercial area in Singapore.
This entertainment data is curated by Prof. Kam, consisting of 114 point coordinates of entertainment area like cinema, theatre, art centers in Singapore. The shapefile is already in SVY21 form so there is no need to do st_transform().
Reading layer `entertn' from data source
`W:\widyayutika\ISSS624\Take-home_Exercise\Take-home_Ex2\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 114 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 10809.34 ymin: 26528.63 xmax: 41600.62 ymax: 46375.77
Projected CRS: SVY21 / Singapore TM
Next, I will check for duplicated entertainment.
Apparently there are 2 duplicate entertainments, so I will remove the duplicated records and keep the first occurrence.
Next, I will count the number of entertainments in each hexagonal grid.
Next, I will check the distribution of ENTERTAINMENT_COUNT.
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.0000 0.0000 0.1331 0.0000 9.0000
tmap_options(check.and.fix = TRUE)
tmap_mode("view")
tm_shape(mpsz)+
tm_polygons(alpha=0.3)+
tm_shape(honeycomb_with_busstop%>% filter(ENTERTAINMENT_COUNT>0)) +
tm_borders() +
tm_fill("ENTERTAINMENT_COUNT", palette = "Blues",
title = "Entertainment Count",
breaks= c(1,3,5,7,9),
legend.show = TRUE,
alpha=0.7) +
tm_view(set.zoom.limits =c(11,14))+
tm_layout(main.title = 'ENTERTAINMENT COUNT on Hexagonal Grid' ,
main.title.position = "center",
main.title.size = 1.0,
main.title.fontface = 'bold',
legend.width=1)From the map above, it is observed that CBD area especially city hall to bugis area exhibits a higher concentration of entertainment in Singapore.
This F&B data is curated by Prof. Kam, consisting of 1,919 point coordinates of F&B area like cafe, restaurant, bar, club, karaoke, pub in Singapore. The shapefile is already in SVY21 form so there is no need to do st_transform().
Reading layer `F&B' from data source
`W:\widyayutika\ISSS624\Take-home_Exercise\Take-home_Ex2\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 1919 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 6010.495 ymin: 25343.27 xmax: 45462.43 ymax: 48796.21
Projected CRS: SVY21 / Singapore TM
Next, I will check for duplicated F&B.
There are no duplicated F&B so next, I will count the number of F&B in each hexagonal grid.
Next, I will check the distribution of FNB_COUNT.
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.000 0.000 2.204 1.000 133.000
tmap_options(check.and.fix = TRUE)
tmap_mode("view")
tm_shape(mpsz)+
tm_polygons(alpha=0.3)+
tm_shape(honeycomb_with_busstop%>%
filter(FNB_COUNT>0)) +
tm_borders() +
tm_fill("FNB_COUNT",
palette = "Blues",
title = "FNB Count",
breaks= c(1,10,20,30,50,70,100,133),
legend.show = TRUE,
alpha=0.7) +
tm_view(set.zoom.limits =c(11,14))+
tm_layout(main.title = 'FNB COUNT on Hexagonal Grid' ,
main.title.position = "center",
main.title.size = 1.0,
main.title.fontface = 'bold',
legend.width=1) From the map above, similar to entertainment, it is observed that CBD area especially tanjong pagar to bugis area exhibits a higher concentration of F&Bin Singapore.
This Financial Service data is curated by Prof. Kam, consisting ]of 3,320 point coordinates of financial service area like bank, money changer in Singapore. The shapefile is already in SVY21 form so there is no need to do st_transform().
Reading layer `FinServ' from data source
`W:\widyayutika\ISSS624\Take-home_Exercise\Take-home_Ex2\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 3320 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 4881.527 ymin: 25171.88 xmax: 46526.16 ymax: 49338.02
Projected CRS: SVY21 / Singapore TM
Next, I will check for duplicated financial services.
Apparently there are 524 duplicate financial services, so I will remove the duplicated records and keep the first occurrence. So, in total there are 3,058 financial services.
Next, I will count the number of financial services in each hexagonal grid.
Next, I will check the distribution of FINSERV_COUNT.
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.000 1.000 3.613 4.000 130.000
tmap_options(check.and.fix = TRUE)
tmap_mode("view")
tm_shape(mpsz)+
tm_polygons(alpha=0.3)+
tm_shape(honeycomb_with_busstop%>%
filter(FINSERV_COUNT>0)) +
tm_borders() +
tm_fill("FINSERV_COUNT",
palette = "Blues",
title = "Financial Service Count",
breaks= c(1,10,20,30,50,70,100,133),
legend.show = TRUE,
alpha=0.7) +
tm_view(set.zoom.limits =c(11,14))+
tm_layout(main.title = 'Financial Service COUNT on Hexagonal Grid' ,
main.title.position = "center",
main.title.size = 1.0,
main.title.fontface = 'bold',
legend.width=1) From the map above, similar to f&b, it is observed that CBD area especially Tanjong Pagar to Bugis area exhibits a higher concentration of Financial Services in Singapore.
This Leisure and Recreation data is curated by Prof. Kam, consisting of 1,217 point coordinates of Leisure and Recreation area like gallery, museum, sport center, park, fitness, playground, studio in Singapore. The shapefile is already in SVY21 form so there is no need to do st_transform().
Reading layer `Liesure&Recreation' from data source
`W:\widyayutika\ISSS624\Take-home_Exercise\Take-home_Ex2\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 1217 features and 30 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 6010.495 ymin: 25134.28 xmax: 48439.77 ymax: 50078.88
Projected CRS: SVY21 / Singapore TM
Next, I will check for duplicated Leisure and Recreation.
There is no duplicated Leisure and Recreation so next, I will count the number of Leisure and Recreation in each hexagonal grid.
Next, I will check the distribution of LEIREC_COUNT.
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.000 0.000 1.307 1.000 41.000
tmap_options(check.and.fix = TRUE)
tmap_mode("view")
tm_shape(mpsz)+
tm_polygons(alpha=0.3)+
tm_shape(honeycomb_with_busstop%>%
filter(LEIREC_COUNT>0)) +
tm_borders() +
tm_fill("LEIREC_COUNT",
palette = "Blues",
title = "Leisure and Recreation Count",
breaks= c(1,10,20,30,41),
legend.show = TRUE,
alpha=0.7) +
tm_view(set.zoom.limits =c(11,14))+
tm_layout(main.title = 'Leisure and Recreation Count on Hexagonal Grid' ,
main.title.position = "center",
main.title.size = 1.0,
main.title.fontface = 'bold',
legend.width=1) From the map above, it is observed that Downtown area exhibits a higher concentration of Leisure and Recreation in Singapore.
This Retails data is curated by Prof. Kam, consisting of 37,635 point coordinates of Retails area like retail shops in Singapore. The shapefile is already in SVY21 form so there is no need to do st_transform().
Reading layer `Retails' from data source
`W:\widyayutika\ISSS624\Take-home_Exercise\Take-home_Ex2\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 37635 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 4737.982 ymin: 25171.88 xmax: 48265.04 ymax: 50135.28
Projected CRS: SVY21 / Singapore TM
Next, I will check for duplicated Retails.
Apparently there are 347 duplicate Retails, so I will remove the duplicated records and keep the first occurrence. So, in total there are 37,460 Retails.
Next, I will count the number of Retails in each hexagonal grid.
Next, I will check the distribution of RETAILS_COUNT.
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 1.00 8.00 43.83 36.00 1669.00
tmap_options(check.and.fix = TRUE)
tmap_mode("view")
tm_shape(mpsz)+
tm_polygons(alpha=0.3)+
tm_shape(honeycomb_with_busstop%>% filter(RETAILS_COUNT>0)) +
tm_borders() +
tm_fill("RETAILS_COUNT",
palette = "Blues",
title = "Retails Count", breaks= c(1,100,500,1000,1500,1669),
legend.show = TRUE,
alpha=0.7) +
tm_view(set.zoom.limits =c(11,14))+
tm_layout(main.title = 'Retails Count on Hexagonal Grid' ,
main.title.position = "center",
main.title.size = 1.0,
main.title.fontface = 'bold',
legend.width=1) From the map above, it is observed that Bugis area exhibits a higher concentration of Retails in Singapore.
This Train Station data is downloaded from LTA DataMall, consisting of 37,635 point coordinates of Retails area like retail shops in Singapore. The shapefile is already in SVY21 form so there is no need to do st_transform().
train_station_sf <- st_read(dsn="data/geospatial", layer="RapidTransitSystemStation") %>%
st_transform(crs = 3414)Reading layer `RapidTransitSystemStation' from data source
`W:\widyayutika\ISSS624\Take-home_Exercise\Take-home_Ex2\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 220 features and 4 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: 6068.209 ymin: 27478.44 xmax: 45377.5 ymax: 47913.58
Projected CRS: SVY21
Coordinate Reference System:
User input: EPSG:3414
wkt:
PROJCRS["SVY21 / Singapore TM",
BASEGEOGCRS["SVY21",
DATUM["SVY21",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4757]],
CONVERSION["Singapore Transverse Mercator",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",1.36666666666667,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",103.833333333333,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",1,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",28001.642,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",38744.572,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["northing (N)",north,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["easting (E)",east,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Cadastre, engineering survey, topographic mapping."],
AREA["Singapore - onshore and offshore."],
BBOX[1.13,103.59,1.47,104.07]],
ID["EPSG",3414]]
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.0000 0.0000 0.1177 0.0000 6.0000
This MRT Station Exit data is downloaded from LTA DataMall, consisting of 565 point coordinates of MRT Exits in Singapore. Then, I use st_transform of sp package to convert coordinates to EPSG code of 3414 for SVY21 (projected coordinate system for Singapore).
MRT_exit_sf <- st_read(dsn="data/geospatial", layer="Train_Station_Exit_Layer") %>%
st_transform(crs = 3414)Reading layer `Train_Station_Exit_Layer' from data source
`W:\widyayutika\ISSS624\Take-home_Exercise\Take-home_Ex2\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 565 features and 2 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 6134.086 ymin: 27499.7 xmax: 45356.36 ymax: 47865.92
Projected CRS: SVY21
Next, I will check for duplicated MRT Exits.
Apparently there are 17 duplicate MRT Exits, so I will remove the duplicated records and keep the first occurrence. So, in total there are 556 MRT Exits.
Next, I will count the number of MRT Exits in each hexagonal grid.
Next, I will check the distribution of MRT Exits.
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.0000 0.0000 0.6619 0.0000 13.0000
tmap_options(check.and.fix = TRUE)
tmap_mode("view")
tm_shape(mpsz)+
tm_polygons(alpha=0.3)+
tm_shape(honeycomb_with_busstop%>% filter(MRT_EXIT_COUNT>0)) +
tm_borders() +
tm_fill("MRT_EXIT_COUNT",
palette = "Blues",
title = "MRT Exit Count", breaks= c(1,4,7,10,13),
legend.show = TRUE,
alpha=0.7) +
tm_view(set.zoom.limits =c(11,14))+
tm_layout(main.title = 'MRT Exits Count on Hexagonal Grid' ,
main.title.position = "center",
main.title.size = 1.0,
main.title.fontface = 'bold',
legend.width=1) From the map above, it is observed that Downtown area exhibits a higher concentration of MRT Exits with a maximum of 13.
In this section, I will append all propulsive(origin) and attraction(destination) variables to my flow_data1 and create new df flow_data_tidy.
First, I will convert grid_id of honeycomb_with_busstop to factor format to match with the “grid_id” of flow_data1
I will tidy up flow_data1.
flow_data_tidy <- flow_data1 %>%
left_join(honeycomb_with_busstop,
by = c('ORIGIN_GRID_ID' = 'grid_id')) %>%
rename(ORIGIN_SCHOOL_COUNT=SCHOOL_COUNT,
ORIGIN_HDB_DWELLING_UNIT=HDB_DWELLING_UNIT,
ORIGIN_BUSINESS_COUNT=BUSINESS_COUNT,
ORIGIN_ENTERTAINMENT_COUNT=ENTERTAINMENT_COUNT,
ORIGIN_FNB_COUNT=FNB_COUNT,
ORIGIN_FINSERV_COUNT=FINSERV_COUNT,
ORIGIN_LEIREC_COUNT=LEIREC_COUNT,
ORIGIN_RETAILS_COUNT=RETAILS_COUNT,
ORIGIN_MRT_EXIT_COUNT=MRT_EXIT_COUNT)%>%
select(-c(busstop_count,area_honeycomb_grid))
flow_data_tidy <- flow_data_tidy %>%
left_join(honeycomb_with_busstop,
by = c('DESTIN_GRID_ID' = 'grid_id')) %>%
rename(DESTIN_SCHOOL_COUNT=SCHOOL_COUNT,
DESTIN_HDB_DWELLING_UNIT=HDB_DWELLING_UNIT,
DESTIN_BUSINESS_COUNT=BUSINESS_COUNT,
DESTIN_ENTERTAINMENT_COUNT=ENTERTAINMENT_COUNT,
DESTIN_FNB_COUNT=FNB_COUNT,
DESTIN_FINSERV_COUNT=FINSERV_COUNT,
DESTIN_LEIREC_COUNT=LEIREC_COUNT,
DESTIN_RETAILS_COUNT=RETAILS_COUNT,
DESTIN_MRT_EXIT_COUNT=MRT_EXIT_COUNT)%>%
select(-c(busstop_count,area_honeycomb_grid))I will write flow_data_tidy into local disk.
Read the flow_data_tidy back into R
Data Exploration on Flow_data
Distribution of trips
Trips vs distance
log(trips) vs log(dist)